{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Celebrity Quote Analysis with The Azure AI Services"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://mmlspark.blob.core.windows.net/graphics/SparkSummit2/cog_services.png\" width=\"800\" style=\"float: center;\"/>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from synapse.ml.services import *\n",
    "from pyspark.ml import PipelineModel\n",
    "from pyspark.sql.functions import col, udf\n",
    "from pyspark.ml.feature import SQLTransformer\n",
    "from synapse.ml.core.platform import find_secret\n",
    "\n",
    "# put your service keys here\n",
    "ai_service_key = find_secret(\n",
    "    secret_name=\"ai-services-api-key\", keyvault=\"mmlspark-build-keys\"\n",
    ")\n",
    "ai_service_location = \"eastus\"\n",
    "bing_search_key = find_secret(\n",
    "    secret_name=\"bing-search-key\", keyvault=\"mmlspark-build-keys\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extracting celebrity quote images using Bing Image Search on Spark\n",
    "\n",
    "Here we define two Transformers to extract celebrity quote images.\n",
    "\n",
    "<img src=\"https://mmlspark.blob.core.windows.net/graphics/Cog%20Service%20NB/step%201.png\" width=\"600\" style=\"float: center;\"/>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "imgsPerBatch = 10  # the number of images Bing will return for each query\n",
    "offsets = [\n",
    "    (i * imgsPerBatch,) for i in range(100)\n",
    "]  # A list of offsets, used to page into the search results\n",
    "bingParameters = spark.createDataFrame(offsets, [\"offset\"])\n",
    "\n",
    "bingSearch = (\n",
    "    BingImageSearch()\n",
    "    .setSubscriptionKey(bing_search_key)\n",
    "    .setOffsetCol(\"offset\")\n",
    "    .setQuery(\"celebrity quotes\")\n",
    "    .setCount(imgsPerBatch)\n",
    "    .setOutputCol(\"images\")\n",
    ")\n",
    "\n",
    "# Transformer to that extracts and flattens the richly structured output of Bing Image Search into a simple URL column\n",
    "getUrls = BingImageSearch.getUrlTransformer(\"images\", \"url\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Recognizing Images of Celebrities\n",
    "This block identifies the name of the celebrities for each of the images returned by the Bing Image Search.\n",
    "\n",
    "<img src=\"https://mmlspark.blob.core.windows.net/graphics/Cog%20Service%20NB/step%202.png\" width=\"600\" style=\"float: center;\"/>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "celebs = (\n",
    "    RecognizeDomainSpecificContent()\n",
    "    .setSubscriptionKey(ai_service_key)\n",
    "    .setLocation(ai_service_location)\n",
    "    .setModel(\"celebrities\")\n",
    "    .setImageUrlCol(\"url\")\n",
    "    .setOutputCol(\"celebs\")\n",
    ")\n",
    "\n",
    "# Extract the first celebrity we see from the structured response\n",
    "firstCeleb = SQLTransformer(\n",
    "    statement=\"SELECT *, celebs.result.celebrities[0].name as firstCeleb FROM __THIS__\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reading the quote from the image.\n",
    "This stage performs OCR on the images to recognize the quotes.\n",
    "\n",
    "<img src=\"https://mmlspark.blob.core.windows.net/graphics/Cog%20Service%20NB/step%203.png\" width=\"600\" style=\"float: center;\"/>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from synapse.ml.stages import UDFTransformer\n",
    "\n",
    "recognizeText = (\n",
    "    RecognizeText()\n",
    "    .setSubscriptionKey(ai_service_key)\n",
    "    .setLocation(ai_service_location)\n",
    "    .setImageUrlCol(\"url\")\n",
    "    .setMode(\"Printed\")\n",
    "    .setOutputCol(\"ocr\")\n",
    "    .setConcurrency(5)\n",
    ")\n",
    "\n",
    "\n",
    "def getTextFunction(ocrRow):\n",
    "    if ocrRow is None:\n",
    "        return None\n",
    "    return \"\\n\".join([line.text for line in ocrRow.recognitionResult.lines])\n",
    "\n",
    "\n",
    "# this transformer wil extract a simpler string from the structured output of recognize text\n",
    "getText = (\n",
    "    UDFTransformer()\n",
    "    .setUDF(udf(getTextFunction))\n",
    "    .setInputCol(\"ocr\")\n",
    "    .setOutputCol(\"text\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Understanding the Sentiment of the Quote\n",
    "\n",
    "<img src=\"https://mmlspark.blob.core.windows.net/graphics/Cog%20Service%20NB/step4.jpg\" width=\"600\" style=\"float: center;\"/>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sentimentTransformer = (\n",
    "    TextSentiment()\n",
    "    .setLocation(ai_service_location)\n",
    "    .setSubscriptionKey(ai_service_key)\n",
    "    .setTextCol(\"text\")\n",
    "    .setOutputCol(\"sentiment\")\n",
    ")\n",
    "\n",
    "# Extract the sentiment score from the API response body\n",
    "getSentiment = SQLTransformer(\n",
    "    statement=\"SELECT *, sentiment.document.sentiment as sentimentLabel FROM __THIS__\"\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tying it all together\n",
    "\n",
    "Now that we have built the stages of our pipeline it's time to chain them together into a single model that can be used to process batches of incoming data\n",
    "\n",
    "<img src=\"https://mmlspark.blob.core.windows.net/graphics/Cog%20Service%20NB/full_pipe_2.jpg\" width=\"800\" style=\"float: center;\"/>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from synapse.ml.stages import SelectColumns\n",
    "\n",
    "# Select the final coulmns\n",
    "cleanupColumns = SelectColumns().setCols(\n",
    "    [\"url\", \"firstCeleb\", \"text\", \"sentimentLabel\"]\n",
    ")\n",
    "\n",
    "celebrityQuoteAnalysis = PipelineModel(\n",
    "    stages=[\n",
    "        bingSearch,\n",
    "        getUrls,\n",
    "        celebs,\n",
    "        firstCeleb,\n",
    "        recognizeText,\n",
    "        getText,\n",
    "        sentimentTransformer,\n",
    "        getSentiment,\n",
    "        cleanupColumns,\n",
    "    ]\n",
    ")\n",
    "\n",
    "celebrityQuoteAnalysis.transform(bingParameters).show(5)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  },
  "name": "Cognitive Services on Spark"
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
